An interpretable latent variable model for attribute applicability in the Amazon catalogue
نویسندگان
چکیده
Learning attribute applicability of products in the Amazon catalog (e.g., predicting that a shoe should have a value for size, but not for battery-type) at scale is a challenge. The need for an interpretable model is contingent on (1) the lack of ground truth training data, (2) the need to utilise prior information about the underlying latent space and (3) the ability to understand the quality of predictions on new, unseen data. To this end, we develop the MaxMachine, a probabilistic latent variable model that learns distributed binary representations, associated to sets of features that are likely to co-occur in the data. Layers of MaxMachines can be stacked such that higher layers encode more abstract information. Any set of variables can be clamped to encode prior information. We develop fast sampling based posterior inference. Preliminary results show that the model improves over the baseline in 17 out of 19 product groups and provides qualitatively reasonable predictions. 1 Attribute Applicability Many real-world datasets can be viewed as object-by-attribute matrices. A prominent example is the Amazon catalogue which contains over 100 million products (objects) and hundreds of attributes, of which only a small subset is assigned to each product. Thus, product-attribute-assignment can be viewed as a sparse binary matrix, shown for a small subsample of the German Amazon marketplace in Fig. 1. Being able to distinguish between attributes that are truly non-applicable (e.g., battery-type for a shoe), attributes that could reasonably be applied (e.g., weight for a book), and attributes that are clearly applicable (e.g., size for a T-shirt) is crucial for applications such as attribute imputation models, data quality management, template generation, product comparison and virtually all customer-facing downstream applications. We can cast the task of predicting attribute applicability as a multi-label classification problem, where each attribute constitutes a label and an arbitrary number of labels is assigned to each product. While there is recent progress in such extreme multi-label classification problems [1, 2], we face a particular challenge: The absence of reliable training labels makes it difficult to define a training metric. Therefore, we approach attribute applicability as an unsupervised problem and develop a probabilistic latent variable model that describes the generative process by which the binary product/applicability matrix is generated from a set of latent features. We aim to retain a simple, interpretable model, resembling the process of a marketplace seller who is filling in attributes for their product. The rationale behind the model is that each latent feature corresponds to a set of attributes that are likely to appear together such as (title, pages, language, release date) or (width, height, length). Each of these sets is represented by a latent dimension and the generative process for any product-attribute-pair is a noisy disjunction of these feature sets. The model design is further ∗Work was done at Amazon Berlin 31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. ar X iv :1 71 2. 00 12 6v 2 [ st at .M L ] 4 D ec 2 01 7
منابع مشابه
A combination of semantic and attribute-based access control model for virtual organizations
A Virtual Organization (VO) consists of some real organizations with common interests, which aims to provide inter organizational associations to reach some common goals by sharing their resources with each other. Providing security mechanisms, and especially a suitable access control mechanism, which enforces the defined security policy is a necessary requirement in VOs. Since VO is a complex ...
متن کاملArithmetic Aggregation Operators for Interval-valued Intuitionistic Linguistic Variables and Application to Multi-attribute Group Decision Making
The intuitionistic linguistic set (ILS) is an extension of linguisitc variable. To overcome the drawback of using single real number to represent membership degree and non-membership degree for ILS, the concept of interval-valued intuitionistic linguistic set (IVILS) is introduced through representing the membership degree and non-membership degree with intervals for ILS in this paper. The oper...
متن کاملApplication of Bayesian Latent Variable Model for Early Detection of Gestational Diabetes Mellitus Without A Perfect Reference Standard Test by β‐human Chorionic Gonadotropin
Background and Objectives: Gestational diabetes mellitus (GDM) is a medical problem in pregnancy, and its late diagnosis can cause adverse effects in the mother and fetus. The purpose of this research was to estimate the accuracy parameters of a biomarker for early prediction of gestational diabetes in the absence of a perfect reference standard test. Methods: This study was conducted in 52...
متن کاملCreating Scalable and Interactive Web Applications Using High Performance Latent Variable Models
In this project we outline a modularized, scalable system for comparing Amazon products in an interactive and informative way using efficient latent variable models and dynamic visualization. We demonstrate how our system can build on the structure and rich review information of Amazon products in order to provide a fast, multifaceted, and intuitive comparison. By providing a condensed per-topi...
متن کاملUsing multivariate generalized linear latent variable models to measure the difference in event count for stranded marine animals
BACKGROUND AND OBJECTIVES: The classification of marine animals as protected species makes data and information on them to be very important. Therefore, this led to the need to retrieve and understand the data on the event counts for stranded marine animals based on location emergence, number of individuals, behavior, and threats to their presence. Whales are g...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- CoRR
دوره abs/1712.00126 شماره
صفحات -
تاریخ انتشار 2017